perf(github): Cache accessible repos for accessibleOnly search by jaydgoss · Pull Request #112548 · getsentry/sentry

jaydgoss · 2026-04-08T22:29:09Z

Summary

The OrganizationIntegrationReposEndpoint (/integrations/{id}/repos/) lets the frontend search for GitHub repos available to a GitHub App installation. When called with accessibleOnly=true and a search query (as the SCM onboarding repo selector does on each debounced keystroke), the previous implementation fetched all installation-accessible repos from the GitHub API (up to 50 pages of 100 = 5,000 repos) on every request, then filtered with a Python list comprehension
Cache the full repo list in sentry.cache.default_cache (Redis) for 5 minutes, and filter locally on subsequent requests — reducing each typed query from O(pages) GitHub API calls to zero

Test plan

Existing get_repositories tests pass (6/6)
New test_get_repositories_accessible_only_caches_repos verifies cache hit path skips /installation/repositories calls
Manual testing: second accessibleOnly search returns instantly from cache

Refs VDY-68

When accessibleOnly=true with a search query, the old path fetched all installation repos (up to 5,000) on every debounced keystroke, then filtered with a Python list comprehension. Replace this with a cached set of accessible repo IDs (5-min Redis TTL) combined with the GitHub Search API, reducing each typed query from O(pages) API calls to a single search call plus a Redis lookup. Refs VDY-68

…h API Switch from Search API + cached ID set to caching the full repo list and filtering locally. This avoids the Search API's shared 30 req/min rate limit and uses sentry.cache.default_cache (Redis-backed) instead of django.core.cache (DummyCache in Sentry). Refs VDY-68

linear-code · 2026-04-08T22:29:13Z

VDY-68 perf(github): Optimize accessibleOnly repo search to avoid fetching all pages per keystroke

Keep the cached repo list unfiltered so the cache is a faithful snapshot of the GitHub API response. Apply the archived filter in get_repositories alongside the other transforms. Also let the accessible_only path handle both with and without a query. Refs VDY-68

The Search API does not return archived repos, so the archived filter should only apply to the /installation/repositories paths.

Move no-query path first since accessible_only is only useful with a query (repeated keystrokes). Combine archived and query filters into a single pass through to_repo_info.

Strip raw GitHub repo dicts down to the 5 fields used by get_repositories before storing in the cache. Reduces per-integration cache size from ~3KB per repo to ~100 bytes.

getsentry configures CACHES with memcached in production, so django.core.cache.cache works and matches the pattern used by the rest of the integrations codebase.

wedamija · 2026-04-09T17:17:17Z

src/sentry/integrations/github/client.py

+    def get_accessible_repos_cached(self, ttl: int = 300) -> list[CachedRepo]:
+        """
+        Return all repos accessible to this installation.
+        Cached in Django cache for ``ttl`` seconds so that debounced
+        search keystrokes don't re-fetch all pages from GitHub.
+
+        Only the fields used by get_repositories() are stored to keep
+        the cache payload small.
+        """
+        cache_key = f"github:accessible_repos:{self.integration.id}"
+        cached = cache.get(cache_key)
+        if cached is not None:
+            return cached
+
+        all_repos = self.get_repos()
+        repos: list[CachedRepo] = [
+            {
+                "id": r["id"],
+                "name": r["name"],
+                "full_name": r["full_name"],
+                "default_branch": r.get("default_branch"),
+                "archived": r.get("archived"),
+            }
+            for r in all_repos
+        ]
+        cache.set(cache_key, repos, ttl)
+        return repos


This isn't really get_accessible_repos_cached, it's get_repos_cached

wedamija · 2026-04-09T17:18:10Z

src/sentry/integrations/github/integration.py

+                r
+                for r in all_repos_cached
+                if not r.get("archived") and query_lower in r["full_name"].lower()
+            )


I think that rather than making the cache use conditional on accessible_only, we should have use_cache=False. We should have an assert like assert not use_cache or not query since the cache doesn't work with the query

A little confused by this comment because the cache is set up to work with the accessible_only variant of querying.

accessible_only is a means to query accessible repos, default query behavior was exposing repos that sentry does not necessarily have access to (public repos not selected during installation configuration).

the cache in its current form its only enabled for the accessible_only path because I didn't want to alter behavior for existing consumers.

that said cache could instead be opt in and applied to either the not query path or the accessible_only path, is that along the lines of your thinking?

Further changes here are introduced with pagination support #112591

My main point is that it doesn't really make sense to make the cache specific to accessible only. We're caching all the repos, and then we're filtering to accessible only in python. So if we add use_cache=False, then in your usages you pass use_cache=True, is_accessible=True it's more general and allows us to use the cache in other places later, if we want to. In general, I feel like Claude tends to do things in an overly specific way so I just wanted to guide us away from that here.

Okay, I will alter things so that cache can be applied to either the not query or accessible_only paths.

Filtering of accessible is not handled via python filtering (its a different github api path), the "archived" filter is kind of a red herring there.

accessible vs non accessible:

accessible repo fetch client.get_repos(
accessible agnostic repo search client.search_repositories(

Oh sorry, I misread here... although I'm confused about the split that exists in the current code because it looks like it fetches all repos from github and then filters out archived repos when it's accessible.

It seems like if there's no query, we should always be using self.get_client().get_repos? I'm not totally sure why we'd use the search api when we're fetching everything

Ok, discussed on slack and I understand what's going on a bit better here now.

Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.

Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.

wedamija · 2026-04-09T17:18:59Z

src/sentry/integrations/github/integration.py

+        # accessible_only: fetch and filter accessible repos (cached)
+        # avoids re-fetching all pages on every debounced keystroke and
+        # avoids the Search API's 30 req/min shared rate limit.


This is specific to one part of the frontend, I'd probably remove this from here, or move it to the api

wedamija · 2026-04-09T17:19:19Z

src/sentry/integrations/github/integration.py

+                if not r.get("archived") and query_lower in r["full_name"].lower()
+            )
+
+        # Query without accessible_only: existing search behavior


Let's remove this comment

wedamija · 2026-04-09T17:19:41Z

src/sentry/integrations/github/integration.py

-                repos = [r for r in repos if query_lower in str(r["identifier"]).lower()]
-            return repos

+        # No query: fetch all accessible repos (without cache)


We can remove this comment too

wedamija · 2026-04-09T20:13:37Z

src/sentry/integrations/github/integration.py

+                r
+                for r in all_repos_cached
+                if not r.get("archived") and query_lower in r["full_name"].lower()
+            )


Ok, discussed on slack and I understand what's going on a bit better here now.

Yeah, I think it still makes sense to have use_cache be explicit. Then whenever is_accessible is true, we can optionally use the cache or not based on it, and just have assert not use_cache or not is_accessible to make sure that we don't confuse folks who use is_accessible.

Probably it'd be nice to get rid of is_accessible completely but idk if we're relying on the behaviour implicitly somewhere.

Add explicit use_cache parameter to get_repositories instead of implicitly tying caching to the accessible_only flag. This makes caching an independent concern that callers opt into explicitly.

Rename method and cache key to reflect that the cache is not specific to the accessible_only path. Remove implementation-detail comments from get_repositories.

sentry-warden · 2026-04-09T21:16:32Z

src/sentry/integrations/github_enterprise/integration.py

        query: str | None = None,
        page_number_limit: int | None = None,
        accessible_only: bool = False,
+        use_cache: bool = False,


use_cache parameter is accepted but ignored in GitHub Enterprise integration

The use_cache parameter was added to the method signature for interface compliance but is never used in the implementation. When accessibleOnly=true is passed to the repos endpoint, the endpoint calls get_repositories(..., use_cache=True) (line 76 of organization_integration_repos.py). The GitHub integration correctly calls client.get_repos_cached() when use_cache=True, but the GitHub Enterprise integration silently ignores this flag and always calls get_client().get_repos() without caching. GitHub Enterprise users will not receive the caching benefit this PR is intended to provide.

Verification

Read github_enterprise/integration.py lines 224-254 to confirm use_cache parameter is never referenced in method body. Read github/integration.py lines 355-376 to see correct implementation using use_cache. Read organization_integration_repos.py line 76 to confirm use_cache=accessible_only is passed to all integrations.

Identified by Warden sentry-backend-bugs · CHR-VP4

Fix attempt detected (commit 3bf81a4)

The use_cache parameter was added to the GitHub Enterprise get_repositories signature for interface compliance, but the method implementation still ignores it and unconditionally calls get_client().get_repos() without any conditional caching logic, identical to the before state in the critical execution paths.

The original issue appears unresolved. Please review and try again.

_{Evaluated by Warden}

this change is siloed to github only, not GHE

Avoid caching the initial no-query load when accessibleOnly is set. Cache is only useful for debounced keystroke searches, not the first page load which should always return fresh data.

jaydgoss added 2 commits April 8, 2026 16:59

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 8, 2026

vercel bot deployed to Preview April 8, 2026 22:30 View deployment

jaydgoss added 2 commits April 8, 2026 17:57

chore(github): Remove debug logging from repo cache

c06ce7a

vercel bot deployed to Preview April 8, 2026 23:00 View deployment

ref(github): Extract to_repo_info helper to DRY up get_repositories

0261ee2

vercel bot deployed to Preview April 8, 2026 23:06 View deployment

fix(github): Only filter archived repos from installation responses

b8dd4e2

The Search API does not return archived repos, so the archived filter should only apply to the /installation/repositories paths.

vercel bot deployed to Preview April 8, 2026 23:11 View deployment

docs(github): Clarify comments in get_repositories

434bdcb

vercel bot deployed to Preview April 8, 2026 23:17 View deployment

ref(github): Reorder get_repositories and combine filters

1589a68

Move no-query path first since accessible_only is only useful with a query (repeated keystrokes). Combine archived and query filters into a single pass through to_repo_info.

vercel bot deployed to Preview April 8, 2026 23:30 View deployment

fix(github): Use Iterable instead of Sequence for generator args

d642bfb

vercel bot deployed to Preview April 8, 2026 23:48 View deployment

perf(github): Cache only required fields for accessible repos

a594905

Strip raw GitHub repo dicts down to the 5 fields used by get_repositories before storing in the cache. Reduces per-integration cache size from ~3KB per repo to ~100 bytes.

vercel bot deployed to Preview April 8, 2026 23:50 View deployment

ref(github): Add CachedRepo TypedDict for cached repo shape

c0987be

vercel bot deployed to Preview April 8, 2026 23:56 View deployment

ref(github): Use django cache instead of sentry default_cache

5b01fd6

getsentry configures CACHES with memcached in production, so django.core.cache.cache works and matches the pattern used by the rest of the integrations codebase.

vercel bot deployed to Preview April 9, 2026 00:02 View deployment

ref(github): Use explicit field picks instead of dict comprehension

c074271

vercel bot deployed to Preview April 9, 2026 00:05 View deployment

ref(github): Move CachedRepo TypedDict to module level

b24d6f5

vercel bot deployed to Preview April 9, 2026 00:09 View deployment

jaydgoss marked this pull request as ready for review April 9, 2026 15:42

jaydgoss requested a review from a team as a code owner April 9, 2026 15:42

wedamija reviewed Apr 9, 2026

View reviewed changes

evanpurkhiser approved these changes Apr 9, 2026

View reviewed changes

wedamija approved these changes Apr 9, 2026

View reviewed changes

ref(github): Decouple use_cache from accessible_only in get_repositories

da09de1

Add explicit use_cache parameter to get_repositories instead of implicitly tying caching to the accessible_only flag. This makes caching an independent concern that callers opt into explicitly.

jaydgoss requested review from a team as code owners April 9, 2026 21:06

vercel bot deployed to Preview April 9, 2026 21:08 View deployment

ref(integrations): Add use_cache param to all get_repositories overrides

68df5e6

vercel bot deployed to Preview April 9, 2026 21:10 View deployment

ref(github): Rename get_accessible_repos_cached to get_repos_cached

155c015

Rename method and cache key to reflect that the cache is not specific to the accessible_only path. Remove implementation-detail comments from get_repositories.

vercel bot deployed to Preview April 9, 2026 21:14 View deployment

sentry-warden bot reviewed Apr 9, 2026

View reviewed changes

fix(github): Only use cache when search query is present

3bf81a4

Avoid caching the initial no-query load when accessibleOnly is set. Cache is only useful for debounced keystroke searches, not the first page load which should always return fresh data.

vercel bot deployed to Preview April 9, 2026 21:24 View deployment

jaydgoss merged commit b691b5d into master Apr 10, 2026
61 checks passed

jaydgoss deleted the jaygoss/vdy-68-perfgithub-optimize-accessibleonly-repo-search-to-avoid branch April 10, 2026 16:08

Uh oh!

Conversation

jaydgoss commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

linear-code bot commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaydgoss Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaydgoss Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sentry-warden bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

sentry-warden bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jaydgoss commented Apr 8, 2026 •

edited

Loading

jaydgoss Apr 9, 2026 •

edited

Loading

jaydgoss Apr 9, 2026 •

edited

Loading